1.0 Initialization

1.1 Installing Required Python Packages

1.2 Importing Required Python Packages

The following packages would need to be imported for use:

Numpy - For mathematical processing and array handling

Pandas - For handling the datasets as dataframes and carrying out related operations

Matplotlib and Seaborn - For visualation and plotting of data

1.3 Loading Building Features and Target Earthquake Damage Datasets

The target and input datasets are loaded in as Pandas datasets. The datasets paths need to be included as raw Strings with the addition of an "r" before the text in order to load properly.

2.0 Initial Analysis of Data

2.1 Displaying The First and Last 8 Rows of the Input and Target Datasets

2.2 Displaying Descriptive Features of the Datasets

The descriptive statistical characteristics of the input and target datasets are explored. The argumend "include=all" is used for the input dataset in order to include data from the categorical columns and not just the numerical ones

From the tables above it can be seen that all columns contain the same number (260601) of elements of data.

Initial observations 1) The geo_level_3_id seems to have a very large standard deviation compared to the other columns, but this is due to the fact that the column contains larger values as a whole.

2) The average age of the buildings in the dataset is approximate 26 years old and contain an average of 2 floors. However, the age of the building contains a large standard deviation of 73 years.

3) The max value in the age column is 995, which may indicate an extremely old building or an error/outlier

3.0 Data Cleaning, Wrangling, and Preprocessing

The input and datasets need to be merged together for ease of processing and this can be acheived using the common primary key "building_id". Categorical variables are then encoded into sets of binary categories for each value. The presence of null, duplicate and other aberrant data are also explored

3.1 Merging Input and Target Datasets by Matching Building ID

3.2 Identifying and Handing Categorical Variables

3.3 Descriptive Features of the Merged Dataset

The descriptive features of the final dataset are explored. This is similar to section 2.2, but used to ensure that no values have been accidentaly modified during the merge

4.0 Exploratory Data Analysis

4.1 Identifying and Visualizing Distribution

The distribution of some of the numerical continuous variables in the input dataset, along with the distribution and value counts of the damage grades are explored here

From the images above, it can be identified that most of the numerical feature columns follow an approximate normal distribution with the majority of the values at the lower end of the x-axis and a few outliers at higher values.

It can also be noted that the area_percentage and building age distribution are slightly right-skewed

From the pie chart above, it can be seen that the majority (56.9%) of the buildings have suffered medium damage,33.5% have been nearly destroyed, and only 9.6% of the building have suffered low damage.

The fact that over 90% of the buildings have suffered medium damage to complete destruction and this research aims to identify how this number can be reduced

4.2 Visualizing and Identifying Correlation

The correlation between all the different colums and the correlation between all the colums and the damage grade will be explored here. Heatmaps would be used for visualization purposes

The following correlations should be noted:

1) The number of floors and the height percentage have a high positive correlation. This is an expected result as the building height increases with the number of floors

2) The highest positive correlation for damage grade with other features is for building with a ground_floor type "r", ground floor type "f", and superstructure of mortar stone. However, these correlation values are less that 0.5 and not enough to draw sufficient conclusions from

3) There is some negative correlation between the damage grade and foundation_type i, roof_type "x", and ground_floor type "v", but once again these are not low enough to draw useful conclusions.

3) Some correlation patterns can be seen within the encoded columns of the categorical data. However, this does not provide any useful information for the research questions

Research Question 1 : Does the land surface condition that the building is constructed on affect the damage caused by earthquake?

RQ1.1: Identifying and Visualizing Damage Grade Vs. Land Surface Condition

RQ1.2: Insights

From the visualizatons above, at first glance it might seem that buildings with a land surface condition type "t" have the highest count of buildings that have sustained medium damage or been completely destroyed. However, when looking through the value counts it can be identified that this is due to the large percentage of buildings being having condition "t". From the value counts it can be seen that 35528 buildings have a land surface condition of "n", 8316 have a condition of "o" and 216757 have a land contion of "t"

Do the the varying quantities of data for each category, the percentage of buildings for each damage grade was calculated and visualized in order to get a better understanding. From this it can be seen that buildings with land type "o" have a slightly higher chance (36.15%) of being completely destroyed over the other two land types.

It can also be seen that land type "t" has the highest percentage of buildings with low damage (10.13%) than other types. Since the largest number of buildings have a "t" ground condition, this indicates that "t" ground conditions offer better protection from medium and high earthquake damage.